Marketing Analytics Process

Predictive Modeling Workflow

Supervised Learning and Prediction

We’ve already spent time with supervised learning, a model with an outcome variable. Specifically, we dealt with regression and classification. How are they different?

We used supervised learning for inference (i.e., to understand the underlying data generating process), but now we only care about prediction. So instead of worrying about the best model for inference, we’ll need to run a lot of models and find which one is best.

Import Data

Let’s import and work with some new data.

# Load packages.
library(tidyverse)
library(tidymodels)
library(dbplyr)
library(DBI)

# Set a simulation seed.
set.seed(42)

The password is practicemakes. It is bad form to save a password in your code.

# Connect to the database.
con <- dbConnect(
  RPostgreSQL::PostgreSQL(),
  dbname = "analyticsdb",
  host = "analyticsdb.ccutuqssh92k.us-west-2.rds.amazonaws.com",
  port = 55432,
  user = "quantmktg",
  password = rstudioapi::askForPassword("Database password")
)

# Look at the available data tables.
dbListTables(con)

# Import from the database.
roomba_survey <- tbl(con, "roomba_survey") |>
  collect()

# Disconnect.
dbDisconnect(con)

# Write data locally.
roomba_survey |> 
  select(-row.names) |> 
  write_csv(here::here("Data", "roomba_survey.csv"))

We can use the survey as a data dictionary.

# Answers to S1?
roomba_survey |> 
  count(S1)
## # A tibble: 3 × 2
##      S1     n
##   <dbl> <int>
## 1     1    40
## 2     3    63
## 3     4   229

Outcome Variable

Previously we were a little lazy and did some feature engineering (i.e., preprocessing) of the outcome variable at the same time as the predictors. We can run into problems that way. Get your outcome variable ready first and leave feature engineering to the features (i.e., predictors).

# Wrangle S1 into segment.
roomba_survey <- roomba_survey |> 
  rename(segment = S1) |> 
  mutate(
    segment = case_when(
      segment == 1 ~ "own",
      segment == 3 ~ "shopping",
      segment == 4 ~ "considering"
    ),
    segment = factor(segment)
  )

Data Splitting by Strata

Once again, one of the first things we need to do is split the data. To be sure that we don’t end up with testing data that doesn’t include one of the categories in our outcome variable, use the strata argument.

# Split data based on segment.
roomba_split <- initial_split(roomba_survey, prop = 0.75, strata = segment)

How could you check and see that this worked?

Decision Trees

We might be tempted to use a logistic regression since this is a classification model, but why wouldn’t it work for this outcome?

Let’s use a decision tree.

  • Instead of fitting a line, split the data based on a decision rule using one of the predictors.
  • Keep adding decision rules based on more predictors to split the data further.
  • Use the resulting regions to classify (or minimize the residual sum of squares if its regression).

Clear as mud? Whiteboard!

Set the model, engine, and mode (since a decision tree can do regression or classification).

# Set the model, engine, and mode.
roomba_model <- decision_tree() |> 
  set_engine(engine = "rpart") |> 
  set_mode("classification")

roomba_model
## Decision Tree Model Specification (classification)
## 
## Computational engine: rpart

Without fit(), the set up here is a list of instructions. Where have we seen this before?

Workflows

A workflow is a tidymodels object that combines the instructions of a recipe and a model.

# Create a workflow.
roomba_wf_01 <- workflow() |> 
  add_formula(
    segment ~ CleaningAttitudes_1 + CleaningAttitudes_2 + CleaningAttitudes_3 + 
      CleaningAttitudes_4 + CleaningAttitudes_5 + CleaningAttitudes_6 + 
      CleaningAttitudes_7 + CleaningAttitudes_8 + CleaningAttitudes_9 + 
      CleaningAttitudes_10 + CleaningAttitudes_11
  ) |> 
  add_model(roomba_model)

roomba_wf_01
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## segment ~ CleaningAttitudes_1 + CleaningAttitudes_2 + CleaningAttitudes_3 + 
##     CleaningAttitudes_4 + CleaningAttitudes_5 + CleaningAttitudes_6 + 
##     CleaningAttitudes_7 + CleaningAttitudes_8 + CleaningAttitudes_9 + 
##     CleaningAttitudes_10 + CleaningAttitudes_11
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Decision Tree Model Specification (classification)
## 
## Computational engine: rpart

Fit a Workflow

We can fit the workflow itself since it includes the formula and model instructions.

# Fit a workflow.
wf_fit_01 <- fit(roomba_wf_01, data = training(roomba_split))

wf_fit_01
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## segment ~ CleaningAttitudes_1 + CleaningAttitudes_2 + CleaningAttitudes_3 + 
##     CleaningAttitudes_4 + CleaningAttitudes_5 + CleaningAttitudes_6 + 
##     CleaningAttitudes_7 + CleaningAttitudes_8 + CleaningAttitudes_9 + 
##     CleaningAttitudes_10 + CleaningAttitudes_11
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## n= 248 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 248 77 considering (0.6895161 0.1209677 0.1895161) *

Evaluate Predictive Fit

Similarly, we can evaluate predictive fit using the fitted workflow. For classification, there are a lot of possible measures of predictive fit, but accuracy is a natural one to use.

# Compute model accuracy.
wf_fit_01 |> 
  predict(new_data = testing(roomba_split)) |>
  bind_cols(testing(roomba_split)) |>
  accuracy(truth = segment, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.690

We can also compute a confusion matrix.

# Compute a confusion matrix.
wf_fit_01 |>
  predict(new_data = testing(roomba_split)) |>
  bind_cols(testing(roomba_split)) |>
  conf_mat(truth = segment, estimate = .pred_class)
##              Truth
## Prediction    considering own shopping
##   considering          58  10       16
##   own                   0   0        0
##   shopping              0   0        0

Feature Engineering

We can certainly do better. What if we included some demographic predictors instead? We’ll still want to dummy code them.

# Build a recipe.
roomba_recipe <- training(roomba_split) |>
  recipe(
    segment ~ D1Gender + D2HomeType + D3Neighborhood + D4MaritalStatus
  ) |>
  step_dummy(all_nominal(), -all_outcomes())

Note that we haven’t used prep() – that’s part of the executing the workflow now.

Update, Fit, and Evaluate

We didn’t have a recipe() before, so we had to use add_formula(). Let’s get rid of it and update our workflow with our new recipe.

# Update the workflow.
roomba_wf_02 <- roomba_wf_01 |> 
  remove_formula() |>
  add_recipe(roomba_recipe)

roomba_wf_02
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Decision Tree Model Specification (classification)
## 
## Computational engine: rpart

By fitting a workflow that includes a recipe, prep(), bake(), and fit() are all executed in one step.

# Fit a second workflow.
wf_fit_02 <- fit(roomba_wf_02, data = training(roomba_split))

wf_fit_02
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## n= 248 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 248 77 considering (0.6895161 0.1209677 0.1895161) *

By predicting using a fitted workflow, it does bake() and predict() in one step.

# Compute model accuracy.
wf_fit_02 |> 
  predict(new_data = testing(roomba_split)) |>
  bind_cols(testing(roomba_split)) |>
  accuracy(truth = segment, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.690

# Compute a confusion matrix.
wf_fit_02 |>
  predict(new_data = testing(roomba_split)) |>
  bind_cols(testing(roomba_split)) |>
  conf_mat(truth = segment, estimate = .pred_class)
##              Truth
## Prediction    considering own shopping
##   considering          58  10       16
##   own                   0   0        0
##   shopping              0   0        0

Growing a Decision Tree

Workflows help with the fact that we’re iterating on a lot of models as we try different predictors and changes to feature engineering. But we haven’t said anything about hyperparameters.

  • tree_depth maximum depth of the tree (default is 30)
  • min_n minimum number of data points in a node to be split (default is 20)

# Specify the model, engine, and mode.
roomba_model <- decision_tree(tree_depth = 5) |> 
  set_engine(engine = "rpart") |> 
  set_mode("classification")

roomba_model
## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   tree_depth = 5
## 
## Computational engine: rpart

# Update the workflow.
roomba_wf_03 <- roomba_wf_02 |> 
  update_model(roomba_model)

roomba_wf_03
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   tree_depth = 5
## 
## Computational engine: rpart

# Fit the workflow.
wf_fit_03 <- fit(roomba_wf_03, data = training(roomba_split))

wf_fit_03
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## n= 248 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 248 77 considering (0.6895161 0.1209677 0.1895161) *

# Compute model accuracy.
wf_fit_03 |> 
  predict(new_data = testing(roomba_split)) |>
  bind_cols(testing(roomba_split)) |>
  accuracy(truth = segment, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.690

# Compute a confusion matrix.
wf_fit_03 |>
  predict(new_data = testing(roomba_split)) |>
  bind_cols(testing(roomba_split)) |>
  conf_mat(truth = segment, estimate = .pred_class)
##              Truth
## Prediction    considering own shopping
##   considering          58  10       16
##   own                   0   0        0
##   shopping              0   0        0

Wrapping Up

Summary

  • Demonstrated splitting data by strata.
  • Discussed decision trees.
  • Walked through building, using, and updating workflows.
  • Tried a little hyperparameter tuning.

Next Time

  • Why have one decision tree when you can have a forest?

Supplementary Material

  • Tidy Modeling with R Chapter 7

Artwork by @allison_horst

Exercise 17

  1. Use case_when() to combine the own and shopping segments into a single category.
  2. Use workflows to combine the cleaning attitudes and demographics we’ve used in class and fit both a logistic regression and a decision tree. Which one has better predictive fit?
  3. Try and tune the decision tree hyperparameters to improve its predictive fit.
  4. Render the Quarto document into Word and upload to Canvas.